%matplotlib inline
import matplotlib.pyplot as plt
# Set the style for the entire notebook using a magic command
#plt.style.use('dark_background')
Author: Jose G. Chavez
In our effort to understand and address the concerning issue of suicide rates, we turn to data analysis and forecasting as powerful tools. Suicide rates are a critical public health concern that necessitates our attention and thoughtful interventions. By utilizing time series analysis, we embark on a journey to forecast these rates, aiming to uncover insights that could inform prevention strategies.
Uncovering Patterns in the Data The initial step involves recognizing patterns hidden within the data. Our dataset provides figures for suicides per 100,000 individuals, urging us to explore further. We narrow our focus to the United States, a country grappling with this complex issue. By examining the data, we aim to reveal stories behind the statistics. The 'year' column becomes our temporal dimension, and 'suicides/100k pop' guides our analysis.
Utilizing Advanced Techniques Our analysis employs three advanced techniques: Seasonal Decomposition, Holt-Winters Exponential Smoothing, and Linear Regression.
Seasonal Decomposition: This technique dissects the data into fundamental components—trend, seasonality, and residuals. This process uncovers recurring patterns and helps us understand the cyclic nature of suicide rates.
Holt-Winters Exponential Smoothing: As a time series forecasting method, this approach allows us to predict future suicide rates while considering trends and seasonality. It's particularly effective for capturing short- to medium-term fluctuations.
Linear Regression: A commonly used predictive modeling method, linear regression helps quantify the relationship between 'year' and 'suicides/100k pop'. This aids in extrapolating the trajectory of suicide rates.
Implications for Prevention Our ultimate aim is to understand the trajectory of suicide rates and apply these insights to prevention efforts. By incorporating findings from our analysis, we hope to contribute to evidence-based strategies that mitigate the impact of suicide. Through rigorous analysis, we strive to make a positive impact and illuminate a path towards a future with reduced suicide rates.
As we navigate this analysis, we emphasize both the technical rigor of our methods and the potential for meaningful change. By forecasting and analyzing data, we aspire to guide interventions that address the pressing issue of suicide, fostering a safer and more supportive environment for all.
Unveiling Patterns in the Darkness¶
The first step is to recognize the patterns and rhythms within the data. The dataset, a collection of poignant figures representing suicides per 100,000 individuals, beckons us to delve deeper. We narrow our focus to the United States, a nation grappling with this complex issue despite the vast number of wealth (in a shrinking segment of the population) it has. By sifting through the data, we uncover not just numbers but stories of lives impacted. The 'year' column becomes our time frame, and 'suicides/100k pop' becomes our focus. This corresponds to the average number of suicides per 100 thousand people regardless of age, race or gender.
import pandas as pd
x = pd.read_csv('kaggleRussellySUI.csv')
x.head(3).style.set_properties(**{'background-color': 'white',
'color': 'blue','font-size':'10px',
'border-color': 'white'})
| country | year | sex | age | suicides_no | population | suicides/100k pop | country-year | HDI for year | gdp_for_year ($) | gdp_per_capita ($) | generation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Albania | 1987 | male | 15-24 years | 21 | 312900 | 6.710000 | Albania1987 | nan | 2,156,624,900 | 796 | Generation X |
| 1 | Albania | 1987 | male | 35-54 years | 16 | 308000 | 5.190000 | Albania1987 | nan | 2,156,624,900 | 796 | Silent |
| 2 | Albania | 1987 | female | 15-24 years | 14 | 289700 | 4.830000 | Albania1987 | nan | 2,156,624,900 | 796 | Generation X |
Unmasking the Heartbeat of Time¶
We begin to see the heartbeat of time as we calculate the average suicide rates per year. Each year carries a story, and these averages reflect the collective pain and struggle that society grapples with. By capturing the essence of each year's suffering, we set the stage for further analysis and understanding.
After some Pandas magic we pull the data from the United States and we find the average rate per 100 thousand across all genders and ages:
x = x.loc[x['country']=='United States']
x.head(3)
x.columns
#x.to_csv('kaggle_united_states_suicides.csv')
y = x[['country','year','suicides/100k pop']]
y.head()
averages_by_year = y.groupby('year')['suicides/100k pop'].mean()
#print(averages_by_year)
## Prepare for Seasonal Decomposition
y.index
y.columns
z = averages_by_year
# Create a new DataFrame to store the averages
averages_df = pd.DataFrame({'Average Suicides/100k pop': averages_by_year.values},
index=pd.to_datetime(averages_by_year.index, format='%Y'))
# Set the frequency to yearly
averages_df.index.freq = 'YS'
#print(averages_df)
#print(averages_df.index)
z.to_csv('KEPP_SUI_DATA.csv')
#sui_d=pd.read_csv('KEEP-_SUI_DATA.csv')
y.head(5).style.set_properties(**{'background-color': 'white',
'color': 'blue','font-size':'10px',
'border-color': 'white'})
| country | year | suicides/100k pop | |
|---|---|---|---|
| 26848 | United States | 1985 | 53.570000 |
| 26849 | United States | 1985 | 29.500000 |
| 26850 | United States | 1985 | 24.460000 |
| 26851 | United States | 1985 | 22.770000 |
| 26852 | United States | 1985 | 21.380000 |
import pandas as pd
# Create your DataFrame (replace this with your actual DataFrame)
data = {'Column1': [1, 2, 3, 4, 5],
'Column2': [6, 7, 8, 9, 10]}
df = pd.DataFrame(data)
# Style the DataFrame
styled_df = df.head(5).style.set_properties(**{
'background-color': 'white',
'color': 'blue',
'font-size': '10px',
'border-color': 'white'
})
print(df.head())
# Adjust the table width
styled_df = styled_df.set_table_styles([
{'selector': 'table', 'props': [('max-width', '200px')]} # Adjust the width as needed
])
# Display the styled DataFrame
styled_df
#print(styled_df.head())
Column1 Column2 0 1 6 1 2 7 2 3 8 3 4 9 4 5 10
| Column1 | Column2 | |
|---|---|---|
| 0 | 1 | 6 |
| 1 | 2 | 7 |
| 2 | 3 | 8 |
| 3 | 4 | 9 |
| 4 | 5 | 10 |
type(styled_df)
pandas.io.formats.style.Styler
Decomposing in Preparation for Forecasting¶
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
# Create a new DataFrame to store the averages
z = pd.DataFrame({'Average Suicides/100k pop': averages_by_year.values},
index=pd.to_datetime(averages_by_year.index, format='%Y'))
# Set the frequency to yearly
z.index.freq = 'YS'
# Apply seasonal decomposition
result = seasonal_decompose(z['Average Suicides/100k pop'], model='multiplicative')
# Plot the decomposition
plt.figure(figsize=(10, 6))
result.plot()
plt.show()
<Figure size 1000x600 with 0 Axes>
# Training and Performance
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from sklearn.model_selection import TimeSeriesSplit
# Create a new DataFrame to store the averages
z = pd.DataFrame({'Average Suicides/100k pop': averages_by_year.values},
index=pd.to_datetime(averages_by_year.index, format='%Y'))
# Set the frequency to yearly
z.index.freq = 'YS'
# Split the data into training and testing sets
train_size = int(len(z) * 0.8) # 80% for training
train, test = z[:train_size], z[train_size:]
# Apply Holt-Winters method on the training data
model = ExponentialSmoothing(train['Average Suicides/100k pop'], trend='add', seasonal='add', seasonal_periods=12)
result = model.fit()
# Forecast for the testing period
forecast = result.forecast(steps=len(test))
# Plot the original training data, testing data, and the forecast
plt.figure(figsize=(8, 4))
plt.plot(train.index, train['Average Suicides/100k pop'], label='Training Data')
plt.plot(test.index, test['Average Suicides/100k pop'], label='Testing Data', color='orange')
plt.plot(test.index, forecast, label='Forecast', color='red')
plt.legend()
plt.show()
# Save the plot as an image (PNG)
plt.savefig("sample_plot.png", format="png")
<Figure size 640x480 with 0 Axes>
Forecasting¶
Forecasting Hope: A Glimpse of What Lies Ahead With a heavy heart, we now turn our gaze towards the future. Could understanding the past help us anticipate the future? We embark on a forecasting journey, employing the Holt-Winters method. This technique acknowledges the intricacies of time—seasonal ebbs and flows, and the gradual evolution of trends. We divide the dataset into training and testing segments, where the past becomes our guide to predict the years to come.
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# Apply Holt-Winters method
model = ExponentialSmoothing(z['Average Suicides/100k pop'], trend='add', seasonal='add', seasonal_periods=12)
result = model.fit()
# Forecast for the next 5 years
forecast = result.forecast(steps=5)
# Regressions
# Plot the original data and the forecast
plt.figure(figsize=(6, 4))
plt.plot(z.index, z['Average Suicides/100k pop'], label='Original Data')
plt.plot(forecast.index, forecast, label='Forecast', color='red')
plt.legend()
plt.show()
Regressions¶
z.head()
| Average Suicides/100k pop | |
|---|---|
| year | |
| 1985-01-01 | 15.393333 |
| 1986-01-01 | 15.970833 |
| 1987-01-01 | 15.971667 |
| 1988-01-01 | 15.642500 |
| 1989-01-01 | 15.203333 |
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
# Assuming your DataFrame "z" is already defined
# z = ...
# Create a copy of the DataFrame for regression
z_for_regression = z.copy()
# Reset the index for regression
z_reset = z_for_regression.reset_index()
# Calculate numerical years since the minimum year
z_reset['numerical_year'] = (z_reset['year'] - z_reset['year'].min()) / pd.Timedelta(days=365.25)
# Prepare the features (X) and target variable (y)
X = z_reset[['numerical_year']]
y = z_reset['Average Suicides/100k pop']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and fit the linear regression model on the training data
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions using the model on the testing data
y_pred = model.predict(X_test)
# Calculate Mean Squared Error (MSE) as a measure of model performance
mse = mean_squared_error(y_test, y_pred)
# Visualize the data and regression line
plt.figure(figsize=(6, 4))
plt.scatter(z_reset['year'], z_reset['Average Suicides/100k pop'], label='Actual Data')
plt.plot(z_reset['year'], model.predict(X), color='red', label='Regression Line') # Use model.predict(X) here
plt.xlabel('Year')
plt.ylabel('Average Suicides/100k pop')
plt.title('Regression Analysis: Average Suicides Over Time')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Print the regression coefficients
print("Intercept:", model.intercept_)
print("Coefficient:", model.coef_[0])
# Print the Mean Squared Error
print("Mean Squared Error:", mse)
Intercept: 15.336419359510753 Coefficient: -0.10321630861360602 Mean Squared Error: 1.3044221658313542
As we can see the Regression line was far less accurate!
Charting a Course Towards Prevention¶
As we conclude our exploration, we're reminded of the gravity of the challenge before us. Suicides are not mere numbers; they are lives intertwined with stories, hopes, and struggles. Through the lens of data and forecasting, we inch closer to understanding the ebbs and flows of this issue. Armed with this knowledge, policymakers, mental health advocates, and communities can forge a path towards prevention.
Technical Takeaways: Unveiling Hidden Connections Beyond the emotional impact, our analysis has technical takeaways. We're poised to delve further by examining correlations between suicide rates and the changing technological landscape. With advancements reshaping the way we connect, communicate, and experience life, we hypothesize that correlations might reveal previously hidden connections. By unearthing these insights, we can contribute to comprehensive strategies for prevention.
!open .
!jupyter nbconvert --to html --template pj Suicide\ Yearly\ Data\ United\ States.ipynb
[NbConvertApp] Converting notebook Suicide Yearly Data United States.ipynb to html [NbConvertApp] Writing 5908179 bytes to Suicide Yearly Data United States.html